Bankruptcy Prediction

Description

A company goes Bankrupt when the assets and shares taken by the company cross the liability. In most cases, bankruptcy models are based on financial indicators that describe the current condition or a certain area of financial health, such as profitability, indebtedness and so on. This research shall analyse the financial statements and market data of these companies and then try to apply several models to determine the bankruptcy. The goal is to find out how far back these models are able to predict that the companies would get into financial distress and which information about the financial ratios would enhance the prediction accuracy of the bankruptcy prediction model.

Icon Legends

Inferences from Outcome
Additional Reads
Lets work!

Index

  1. Import Libraries
  2. Set Options
  3. Read Data
  4. Data Analysis and Preparation
  5. Modeling
  6. Model selection and interpretation
  7. Analysing False Positives

1. Importing required Libraries

2. Set Options

3. Reading the dataset

Let's import the dataset and start working

4. Data Analysis and Preparation

Data analysis is a process of inspecting, cleansing, transforming, and modeling data with the goal of discovering useful information, informing conclusions, and supporting decision-making.

We will follow the steps mentioned below to analyse the dataset:-
1. Check dimensions of the dataframe in terms of rows and columns
2. Check the data types. Refer data definition to ensure your data types are correct. If not, make necessary changes
3. Study summary statistics
4. Univariate Analysis: Visualise target and Study distributions of independent variables
5. Detect outliers
6. Check for missing and duplicate values
7. Bivariate Analyis: Study correlation
8. Analyze relationship between target variable and independent variables

4.1 Understanding the Dataset

4.1.1 Data Dimension

There are 6819 rows and 96 columns in the dataset

4.1.2 Checking Data Types

All features are of numeric data type(int or float)

4.1.3 Summary Statistics

Lets perform summary statistics on the dataset and remove insignificant variables
The above output illustrates the summary statistics of all the numeric variables like the mean, median(50%), minimum, and maximum values, along with the standard deviation.
If we observe the count of all the variables, all of them are equal. So we can say that there are probably no missing values in these variables. Also, standard deviation of Net Income Flag is 0, which means that this feature is insignificant for further analysis.

4.1.4 Analyising Target Variable

Lets analyise the target variable "Bankrupt?"
There are way more Healthy firms than Bankrupt firms in the dataset

4.1.5 Distribution Analysis using measure of kurtois, skewness and distribution plots

Lets analyse distribution of each variable in the dataset using the measure kurtosis and skewness. Then we will plot distribution plots for every variable.
As seen in our kurtosis and skewness anaylsis as well, most features are skewed and leptokurtic, so median will be a better measure of central tendency

4.1.6 Outlier Analysis using Boxplots

Lets analyse outliers for each variable in the dataset using Boxplots.
There are alot of outliers in most of the features, removing these outliers might improve the performance of our model(s)

4.1.7 Null and Duplicate values

There are no null or duplicate values in the dataset

4.1.8 Correlation Analysis

Correlation is a statistic that measures the degree to which two variables move in relation to each other.

In order to compute the correlation matrix, we perform the following:
1. Call the corr() function which will return the correlation matrix of numeric variables
2. Pass the correlation matrix to the heatmap() function of the seaborn library to plot the heatmap of the correlation matrix
3. Extract list of highly correlated pairs (corr > 0.7)
There seems to be a high - extreme correlation between some of the independent variables

4.1.9 Bivariate and Multivariate Analysis

In order to perform Bivariate and Multivariate Analysis, we perform the following:
1. Plot barplots to compare medians of Bankrupt and healthy firms for all features
2. Identify pairs with extremly high correlation
3. Plotting a scatterplot of those pairs with hue as "Bankrupt?"
We see multiple relatioships from our Bivariate and multivariate analysis, we analyse these relationships in details in the next section.

Data Analysis and EDA Inferences

Univarite Analysis

  1. We found that Nett income flag and Liability assets flag would be insignificnt for further analysis as they're not rich with information.

  2. There are way more healthy firms than Bankrupt in the target feature, pointing to the fact that there might be a need for up-sampling.

  3. Most of the features are highly skewed and leptokurtic, median would be a better measure of central tendency and transformation techniques need to be applied.

  4. Most features are rich in outliers, some of the features also have incorrect values/mistakes at the far end. Removing/treating these outliers might help increase accuracy of our models. There are quiet a few approaches we could take to treat them. One method thats gaining attraction in the industry is the KNNimputer. We could also remove all the outliers or we could impute them with the median values of the features. We could also take out values that seem like mistakes i.e. greater than 1 for some of the financial ratios and try different transformation techniques on the resultant data. Another way to treat the outliers could be by imputing them with median of healthy and bankrupt firms of that column i.e. imputing outlier values of a feature that fall under bankruptcy with median of the values of that feature that fall under bankrupt firms and the same for healthy firms

  5. There are no null Values in any of the columns

Scaling and Transformation

As most of the data is ranged between 0 - 1, we can use min-max scalling for columns that are not contained within this range if needed during modelling, it should be kept in mind that this will not help treat outliers/skewness but only bring everything to scale.

We can apply log(x+1) transformation on the columns with skewness more than 1 or less than -1, we're suggesting log(x+1) because most of the values are really small to begin with and applying direct log might not yield desirable results

Bivariate and Multivariate Analysis

- Correlation Analysis

We found that alot of the independent variables are correlated with each other pointing to moderate-high multicolinearity, we identified 77 distinct pairs of independent features that had correlation greater than 0.7 or less than -0.7.

- Ind-Dep Feature Bar Charts

We plotted bar charts with Bankrupt? on x-axis and median of each independent feature on y-axis separately to see whether there were any clear relationships we could identify, we found that

- Scatterplots Of Extremly Correlated Features

We plotted scatterplots for extremely correlated features(corr > .90) with hue as bankruptcy to see whether there are any evident clusters, we found that

4.2 Data Preparation & Feature Engineering

Data preparation is a process of cleansing data to get it ready for modelling.

We will follow the steps mentioned below to prepare the dataset:-
1. Remove redundant columns based on Domain knowledge
2. Remove outliers using IQR method
3. Impute null values created due to outliers removal
4. Reduce mutlicolinearity using Variance Inflation Factor
5. Transform skewed variables
6. Train - Test split

4.2.1 Removal of redundant features based on Domain knowledge

4.2.2 Outlier Removal using IQR

Lets remove outliers using the IQR method.
As we can see all the outliers have been removed. We can see new outliers which is expected.

4.2.3 Null value treatment

Lets treat the null values resulting from removal of outliers.

We will follow the steps mentioned below to treat null values:-
1. Find features with high null values and see if they can be dropped
2. Transform the skewed columns using square root transformation
3. Scale the data to get it ready for KNN imputation.
4. Impute null values using KNNImputer
5. Plot the medians to see if there's any deviation
Null value treatment summary and inferences:
1. There were 4 features with above 20% null values but not high enough to be dropped
2. With n_neighbors for KNN imputation as 5, we imputed the nan values.
3. Ploting the medians showed us that the treated dataset roughly overlaps the untreated dataset, pointing to the fact that the orignal dataset has been preserved.
Read more about the KNNImputer and why is it gaining traction in the Data Science community :

KNNImputer: A robust way to impute missing values (using Scikit-Learn)

4.2.4 Multicolinearity reduction

Lets find the features exhibiting high multicolinearity using Variance Inflation Factor and remove them
Columns with VIF > 5 have been removed. There are 48 columns remaining.

4.2.5 Train-Test split

Lets perform the Train-Test split using following steps:-
1. Specify Target and predictor variables.
2. Call train_test_split from sklearn and perform the split.
3. Print the shapes of all the splits.

5. Modeling

Our Data analysis and preparation is complete, now we move onto the next stage where we construct predictive models using various predictive algorithms and select the best one for deployment. Our aim is to maximise recall as letting unhealthy companies slip by is more costly than classifying some of the healthy companies as unhealthy.

We will explore and optimise the following models:-
1. Logistic Regression(Maximum likelyhood estimation)
2. Decision Tree
3. Random Forest
4. Boosting
5. K-Nearest Neighbors
6. Naïve Bayes

5.1 Logistic Regression(MLE)

We will follow the following steps for Logisitc Regression:-
1. Build a full Logistic Regression model using Maximum likelyhood estimation
2. Plot Receiver Operating Characteristic(ROC) and get Area under Curve(AUC)
3. Build scorecard to see how different metrics like Precision, Recall, F1-score and kappa are performing under different thresholds.
4. Perform recursive feature elimination(RFE)
5. Build a model with features that survived RFE.
6. Repeat steps 2 and 3 for RFE model
7. Compare both models

5.1.1 Full model

1. Our full Logistic Regression(MLE) model gave us an ROC of 0.9556.
2. Maximum precision was 0.833333 at thresholds 0.9
3. Maximum Recall was 0.948718 at threshold 0.020431 obtained through youdens index.
4. Maximum F1 was 0.483516 at threshold 0.2
5. Maximum kappa was 0.459989 at threshod 0.2

5.1.2 Recursive Feature Elimination : Logistic Regression

1. Our Logistic Regression model after RFE gave us an ROC of 0.9593 which is barely above our full model.
2. Maximum observed precision was 0.833333 at threshold 0.9.
3. Maximum observed Recall was 0.974359 at threshold 0.020815 obtained through youdens index
4. Maximum observed F1 was 0.5 at threshold 0.3
5. Maximum observed kappa was 0.480183 at threshod 0.3

5.1.3 Model comparision : RFE vs Full

Our aim is to maximise recall as letting unhealthy companies slip by is more costly than classifying some of the healthy companies as unhealthy. Following that line of logic our recall is maximised with the RFE model at thresh 0.020815. Our best Logit model is the RFE model.

5.2 Decision Tree

We will follow the following steps for Decision Tree:-
1. As Decision trees are not sensitive to outliers, we will proceed with raw data
2. Create an iterable to tune parameters like max_depth, min_samples_split, min_samples_leaf, max_features and max_leaf_nodes using RandomisedSearchCV
3. Build a model with the best hyperparameters
4. Build scorecard to see how different metrics like Precision, Recall, F1-score and kappa are performing under different thresholds.
5. Plot confusion matrix at best threshold
6. Add model with best threshold(recall maximum) to the score card
7. Look at feature importance
1. ROC-AUC after hyperparameter tuning was 0.9063.
2. Recall at threshold obtained through youdens index(0.074830) is 0.756410.
3. Net Income to stockholders equity a.k.a return on equity was the most important feature.

5.3 Random Forest

We will follow the following steps for Random Forest Classifier:-
1. As Random Forest Classifier is sensitive to high correlation, we will proceed with cleaned data
2. Tune parameters like n_estimators,max_depth, min_samples_split, min_samples_leaf, max_features and max_leaf_nodes using RandomisedSearchCV
3. Build a model with the best hyperparameters
4. Build scorecard to see how different metrics like Precision, Recall, F1-score and kappa are performing under different thresholds.
5. Plot confusion matrix at best threshold
6. Add model with best threshold(recall maximum) to the score card
7. Look at feature importance
1. ROC-AUC after hyperparameter tuning was 0.946236.
2. Recall at threshold obtained through youdens index(0.050021) is 0.897436.
3. Net value per share and borrowing dependency are 2 of the most important features.

5.4 Boosting

We will build the following boosting algorithms:-
1. Adaptive Boosting
2. Gradient Boosting
3. Extreme Graidient Boosting(XGBoost)

5.4.1 Adaptive Boosting

We will follow the following steps for Ada Boost:-
1. We will proceed with cleaned data
2. Tune parameters like n_estimators and learning rate using RandomisedSearchCV
3. Build a model with the best hyperparameters
4. Build scorecard to see how different metrics like Precision, Recall, F1-score and kappa are performing under different thresholds.
5. Plot confusion matrix at best threshold
6. Add model with best threshold(recall maximum) to the score card
7. Look at feature importance
1. ROC-AUC after hyperparameter tuning was 0.9399.
2. Optimal Threshold is 0.336667 as we got best Recall(0.9743) at this threshold.
3. Non-industry income and expenditure/revenue and borrowing dependency are the most important features

5.4.2 Gradient Boosting

We will follow the following steps for Gradient Boosting:-
1. We will proceed with cleaned data to tune the model
2. First of all we tune the tree based parameters by taking an initial guess for n_estimator and learning_rate.
3. Tune parameters like max_depth, min_samples_split, min_samples_leaf, max_features, max_leaf_nodes and random_state using RandomisedSearchCV.
4. With best tree based parameters got from first tuned model, we tune the boosting parameters like n_estimator and learning_rate.
5. Tune the min_impurity_decrease to see if our best score improves.
6. Build a tuned model with the best hyperparameters
7. Build metric scorecard to see how different metrics like Precision, Recall, F1-score and kappa are performing under different thresholds.
8. Make Youdens table to see best threshold got from maximising youdens index
9. Compare thresholds got from youdens tab;e and metric scorecard
10. Plot confusion matrix at best threshold
11. Add model with best threshold to the score card
12. Look at feature importance
1. ROC-AUC for train and test after hyperparameter tuning are 0.9876 and 0.9523.
2. Recall at threshold obtained through youdens index(0.019109) is 0.876712.
3. There is a same recall value at threshold(0.02) but it has better values for precision, f1_score, kappa
4. We choose 0.02 as best threshold.
5. Net Value per share (B) is the most important feature.

5.4.3 Extreme Graidient Boosting(XGBoost)

We will follow the following steps for XG Boost:-
1. Even though XG Boost can handle unclean data, we will proceed with cleaned data
2. Tune parameters like n_estimators,max_depth, min_samples_split, min_samples_leaf, max_features and max_leaf_nodes using RandomisedSearchCV
3. Build a model with the best hyperparameters
4. Build scorecard to see how different metrics like Precision, Recall, F1-score and kappa are performing under different thresholds.
5. Plot confusion matrix at best threshold
6. Add model with best threshold(recall maximum) to the score card
7. Look at feature importance
1. ROC-AUC for train and test after hyperparameter tuning are 0.97438 and 0.950119.
2. Recall at threshold obtained through youdens index(0.020784 and 0.015142) is 0.910256 and 0.961538.
3. As the increase in recall is more than the decrease in precision we chose 0.015142
4. Degree of financial leverage and borrowing dependency are the most important features

5.5 K-Nearest Neighbors

We will follow the following steps for K-Nearest Neighbors:-
1. As KNN is sensitive to unscaled data, we will proceed with cleaned data
2. Tune hyperparameters like n_neighbors and p
3. Build model with tuned hyperparameters
4. Build scorecard to see how different metrics like Precision, Recall, F1-score and kappa are performing under different thresholds.
1. ROC-AUC for train and test after hyperparameter tuning are 0.97438 and 0.950119.
2. Recall at threshold obtained through youdens index(0.020784 and 0.015142) is 0.910256 and 0.961538.
3. As the increase in recall is more than the decrease in precision we chose 0.015142
4. Degree of financial leverage and borrowing dependency are the most important features

5.5 Naive Bayes

We will follow the following steps for Naive Bayes:-
1. As Naive Bayes sensitive to uncleandata, we will proceed with cleaned data
2. Build Gaussian Naive bayes
3. Build scorecard to see how different metrics like Precision, Recall, F1-score and kappa are performing under different thresholds.
1. ROC-AUC for train and test after hyperparameter tuning are 0.92 and 0.93.
2. Recall at best threshold (0.35) is 0.910256 with an improvement at precision at 0.205202 .

6. Model Selection and Interpretation

We will follow the following steps for analysis and interpretation of the models:-
1. Visualise metrics Recall,precision and kappa for all the models and compare
2. Pick the best model, visualise and interpret coefficeints
3. Find features are most important in all the tree based models
As RFE logit model gives the best recall with a better precision, we will choose this model.

Interpreting coefficients and significance

The red bars mean the feature is not statistically significant.

Visualising Important features from Tree based models

Net value per share and borrowing dependency are the most important features

7. Analysing False Positives

Lets analyse the common false positives
We can say that in the range in which these financial ratios fall together when misclassified is a grey area